To identify individuals likely to click on ads
The analysis will be successful if I can find individuals likely to click on the ads
A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog.
She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process.
She has employed a Data Science Consultant to help her identify which individuals are most likely to click on her ads.
library('data.table')
library('tidyverse')
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.3 ✔ purrr 0.3.4
✔ tibble 3.0.4 ✔ dplyr 1.0.2
✔ tidyr 1.1.2 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between() masks data.table::between()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::first() masks data.table::first()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::last() masks data.table::last()
✖ purrr::transpose() masks data.table::transpose()
library('chron') # For working with datetime
advertising <- fread('http://bit.ly/IPAdvertisingData')
Column descriptions
‘Daily Time Spent on Site’: consumer time on site in minutes
‘Age’: cutomer age in years
‘Area Income’: Avg. Income of geographical area of consumer
‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
‘Ad Topic Line’: Headline of the advertisement
‘City’: City of consumer
‘Male’: Whether or not consumer was male
‘Country’: Country of consumer
‘Timestamp’: Time at which consumer clicked on Ad or closed window
‘Clicked on Ad’: 0 or 1 indicated clicking on Ad
obtained from kaggle discussion
since the timestamps show times on leaving and entering the site clicked on ad entry 0 implies they were leaving the site while 1 they were entering the site
previewing the top of the dataset
head(advertising)
checking the datatypes of the columns in the dataset
str(advertising)
Classes 'data.table' and 'data.frame': 1000 obs. of 10 variables:
$ Daily Time Spent on Site: num 69 80.2 69.5 74.2 68.4 ...
$ Age : int 35 31 26 29 35 23 33 48 30 20 ...
$ Area Income : num 61834 68442 59786 54806 73890 ...
$ Daily Internet Usage : num 256 194 236 246 226 ...
$ Ad Topic Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
$ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
$ Male : int 0 1 0 1 0 1 0 1 1 1 ...
$ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
$ Timestamp : chr "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
$ Clicked on Ad : int 0 0 0 0 0 0 0 1 0 0 ...
- attr(*, ".internal.selfref")=<externalptr>
Summary of the dataset
summary(advertising)
Daily Time Spent on Site Age Area Income Daily Internet Usage
Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
Median :68.22 Median :35.00 Median :57012 Median :183.1
Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
Ad Topic Line City Male Country
Length:1000 Length:1000 Min. :0.000 Length:1000
Class :character Class :character 1st Qu.:0.000 Class :character
Mode :character Mode :character Median :0.000 Mode :character
Mean :0.481
3rd Qu.:1.000
Max. :1.000
Timestamp Clicked on Ad
Length:1000 Min. :0.0
Class :character 1st Qu.:0.0
Mode :character Median :0.5
Mean :0.5
3rd Qu.:1.0
Max. :1.0
Checking for null values
colSums(is.na(advertising))
Daily Time Spent on Site Age Area Income
0 0 0
Daily Internet Usage Ad Topic Line City
0 0 0
Male Country Timestamp
0 0 0
Clicked on Ad
0
There were no null values in the dataset
Checking for Duplicates
dim(advertising[duplicated(advertising)])
[1] 0 10
There were no duplicates found
spliting the timestamo column to date and time
time_stamp <- advertising$Timestamp
parts <- t(as.data.frame(strsplit(time_stamp,' ')))
advertising$dates <- as.Date(parts[,1]) #saving dates
advertising$times <- as.times(parts[,2])#saving time
# view(advertising)
age <- advertising$Age
area_income <- advertising$`Area Income`
time_on_site <- advertising$`Daily Time Spent on Site`
internet_usage <- advertising$`Daily Internet Usage`
gender <- as.character(advertising$Male)
ad <- as.character(advertising$`Clicked on Ad`)
date <- advertising$dates
time <- advertising$times
country <- advertising$Country
city <- advertising$City
ad_topic <- advertising$`Ad Topic Line`
boxplot(time_on_site)$out
numeric(0)
No outliers in amount of time spent on the site
boxplot(age)$out
numeric(0)
# outlier(advertising$Age)
No outliers in the ages of the users
boxplot(area_income)$out
[1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
# outlier(advertising$`Area Income`)
Some few outliers in the areas of income mostly the lower income areas, removing these may cause a loss of valuable information, hence will not be removed
boxplot(internet_usage)$out
numeric(0)
# outlier(advertising$Age)
there were no outliers in the internate usage time
boxplot(time)
There were no outliers in the time users were accessing or leaving the site
boxplot(date)
There were no outliers in the dates users were accessing or leaving the site
Univariate Analysis
get.mode <- function(v){
uniq <- unique(v)
# gets all the unique values in the column
# match (v, uniq) matches a value to the unique values and returns the index
# tabulate (match (v, uniq)) takes the values in uniq and counts the number of times each integer occurs in it.
# which.max() gets the index of the first maximum in the tabulated list
# then prints out the uniq value
uniq[ which.max (tabulate (match (v, uniq)))]
}
mean(date); median(date); get.mode(date)
[1] "2016-04-09"
[1] "2016-04-07"
[1] "2016-04-04"
Access to the site was balanced with a bit more before april , moreover there being more activity in april than other months, specifically on fourth april
max(date); min(date)
[1] "2016-07-24"
[1] "2016-01-01"
dates when users were accessing or leaving the site ranged from january 1 ,2016 and august 24 ,2016
The countries with the most consumers
table.country <- table(country) # creates a frequency table
view(table.country)#viewing the table
table.country <- table.country[order(-table.country)] # re-ordering the table
head(table.country,10) # previewing the ordered table
country
Czech Republic France Afghanistan Australia Cyprus
9 9 8 8 8
Greece Liberia Micronesia Peru Senegal
8 8 8 8 8
The countries with the least consumers
tail(table.country)
country
Marshall Islands Montserrat Mozambique
1 1 1
Romania Saint Kitts and Nevis Slovenia
1 1 1
Countries where there were the most ad clicks
only.ad <- country[ad==1]
table.country.ad <- table(only.ad)
view(table.country.ad)
table.country.ad <- table.country.ad[order(-table.country.ad)]
head(table.country.ad,10)
only.ad
Australia Ethiopia Turkey Liberia Liechtenstein
7 7 7 6 6
South Africa Afghanistan France Hungary Mayotte
6 5 5 5 5
Cities with the most activity on the site
table.city <- table(city)
view(table.city)
table.city <- table.city[order(-table.city)]
head(table.city,10)
city
Lisamouth Williamsport Benjaminchester East John East Timothy
3 3 2 2 2
Johnstad Joneston Lake David Lake James Lake Jose
2 2 2 2 2
only.ad <- city[ad==1]
table.city.ad <- table(only.ad)
view(table.city.ad)
table.city.ad <- table.city.ad[order(-table.city.ad)]
head(table.city.ad,10)
only.ad
Lake David Lake James Lisamouth Michelleside Millerbury Robertfurt
2 2 2 2 2 2
South Lisa West Amanda West Shannon Williamsport
2 2 2 2
mean(age); median(age); get.mode(age)
[1] 36.009
[1] 35
[1] 31
most consumers were 31 years, with the average age at 36 years implying its skewed to the left
max(age); min(age)
[1] 61
[1] 19
ages ranged from minimum( 19 ) to maximum( 61 ) years
quantile(age,probs=c(0.05,0.95))
5% 95%
23.95 52.00
most of the people ranged between 23 and 52 years
var(age); sd(age)
[1] 77.18611
[1] 8.785562
ggplot(advertising,aes(age))+ geom_density()
the ages are skewed to the left, alot of people are younger
mean(time_on_site); median(time_on_site); get.mode(time_on_site)
[1] 65.0002
[1] 68.215
[1] 62.26
average time on site was 65 minutes, with more people spending 62 minutes on site.
max(time_on_site); min(time_on_site)
[1] 91.43
[1] 32.6
time on the site ranged from 32 to 91 minutes
quantile(time_on_site,probs=c(0.05,0.95))
5% 95%
37.5765 86.1995
most people spent between 37.6 and 86 minutes in the site.
var(time_on_site); sd(time_on_site)
[1] 251.3371
[1] 15.85361
ggplot(advertising,aes(time_on_site))+ geom_histogram(fill='#222222')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Time has two peaks around 40 and 80 minutes,showing two classes of consumers, ones who spend a longer time on site and those who spend less time.
mean(time); median(time); get.mode(time)
[1] 12:09:09
[1] 12:05:51
[1] 17:39:06
The average time consumers accessed onr left the site was at noon, with most people accessing it at 5:39 pm
max(time); min(time)
[1] 23:59:06
[1] 00:00:48
access times ranged all day (24 hours)
quantile(time,probs=c(0.05,0.95))
5% 95%
01:13:50 22:49:45
most people accessed the site between 1:13 am and 22:49 pm
mean(area_income); median(area_income); get.mode(area_income)
[1] 55000
[1] 57012.3
[1] 61833.9
The average income area was at 55,000, while the income area with the most consumers was 61,833.9, since mean is lower than the median more consumers are above the midian
max(area_income);min(area_income)
[1] 79484.8
[1] 13996.5
The areas of income ranged from 13996.5 to 79484.8
quantile(area_income,probs=c(0.05,0.95))
5% 95%
28275.30 73600.72
most people were between 28275.30 and 73600.72 area income brackets
var(area_income); sd(area_income)
[1] 179952406
[1] 13414.63
ggplot(advertising,aes(area_income))+ geom_density()
Thedensity plot is skewed to the right implying alot more people were above the median price bracket
ggplot(advertising,aes(area_income))+ geom_histogram(fill = "#222222", colour = "#038b8d")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean(internet_usage); median(internet_usage); get.mode(internet_usage)
[1] 180.0001
[1] 183.13
[1] 167.22
the average internet usage was 180 minutes, however alot of consumers spent 167.22 minutes online
max(internet_usage); min(internet_usage)
[1] 269.96
[1] 104.78
The range of the time spent was from 104.78 to 269.96 minutes
quantile(internet_usage,probs=c(0.05,0.95))
5% 95%
113.5095 246.7345
most people spent between 113.5 t0 246.7 minutes on the internet
var(internet_usage); sd(internet_usage)
[1] 1927.415
[1] 43.90234
ggplot(advertising,aes(internet_usage))+ geom_density()
there were two peaks at around 125 and 225 minutes on the internet, showing two brackets of people spending different amounts of time on the internet
ggplot(advertising,aes(internet_usage))+ geom_histogram(fill = "#222222", colour = "#038b8d")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(advertising,aes(gender))+ geom_bar()
There were more females who accessed the site
ggplot(advertising,aes(ad))+ geom_bar(fill='#222222')
There was an equal number of people whoa accessed the site from both not clicking ads and clicking them
# library('viridis')
ggplot(advertising,aes(gender,fill=ad))+ geom_bar()
Most females accessing the site had clicked an ad while most males visiting the site had not clicked an ad
ggplot(advertising,aes(internet_usage,time_on_site))+ geom_point(alpha=0.5)+
geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x
People who spent more time on the internet tended to stay longer on the site
ggplot(advertising,aes(internet_usage,time_on_site,color=ad))+ geom_point(alpha=0.75)+
geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x
Most people who clicked on an ad spent less time on the site and the internet compared to those who did not click an add. However considering the groups individually consumers spent less time on the site the longer they spent on the internet
ggplot(advertising,aes(age,internet_usage))+ geom_point(alpha=0.5)+
geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x
There was a decline in internet usage as consumers got older.
ggplot(advertising,aes(age,internet_usage,color=ad))+ geom_point(alpha=0.75)+
geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x
Internet usage from those who exited the site is increasing with age Internet usage for those who clicked was fairly constant with a slight decline with age, moreover most were 35 years and above(an older generation)
ggplot(advertising,aes(age,time_on_site))+ geom_point(alpha=0.5)+
geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x
Time on site went down the older the consumer got, Content may be geared towards a younger demographic.
ggplot(advertising,aes(age,time_on_site,color=ad))+ geom_point(alpha=0.75)+
geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x
Time on the site Consumers leaving the site increased with age, the content could be more relevant to consumers around 30 years or they are loyal to the site. The time on the site was fairly constant with those who clicked the ad (around 52 minutes) at different ages.
ggplot(advertising,aes(area_income, fill=ad,color='black'))+ geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
most of people clicking on the ads ranged from 40000 and 60000 areas of income. People leaving the site ranged from 50000 and above exceeded the amount of people coming into the site through ads
ggplot(advertising,aes(age,area_income))+ geom_point(alpha=0.5)+
geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x
Areas of income decreased as age increased
ggplot(advertising,aes(area_income,age,color=ad))+ geom_point(alpha=0.75)+
geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x
The areas of income of those leaving the site were increasing with age, while those who clicked on ads decreased slightly with age average age of those clicking the ad was 40
ggplot(advertising,aes(time_on_site,area_income))+ geom_point(alpha=0.5)+
geom_quantile(size=1 ,alpha = 1,color="#1abc9c")
Smoothing formula not specified. Using: y ~ x
The time on site increased with The area of income.
ggplot(advertising,aes(time_on_site,area_income,color=ad))+ geom_point(alpha=0.75)+
geom_quantile(size=0.9 ,alpha = 1,quantiles=c(0.25,0.5,0.75))
Smoothing formula not specified. Using: y ~ x
Smoothing formula not specified. Using: y ~ x
areas of income was fairly constant for those clicking ads, area of income may not have much of an impact on time on site by those clicking ads, however time on site increases as area of income decreases meaning it could have an impact on retention of consumers
advert <- subset(advertising, select = c(1:4,10,12))
cov(advert)
Daily Time Spent on Site Age Area Income
Daily Time Spent on Site 2.513371e+02 -46.174146 6.613081e+04
Age -4.617415e+01 77.186105 -2.152093e+04
Area Income 6.613081e+04 -21520.925797 1.799524e+08
Daily Internet Usage 3.609919e+02 -141.634816 1.987625e+05
Clicked on Ad -5.933143e+00 2.164665 -3.195989e+03
times 8.465934e-05 -0.130389 1.347223e+02
Daily Internet Usage Clicked on Ad times
Daily Time Spent on Site 3.609919e+02 -5.933143e+00 8.465934e-05
Age -1.416348e+02 2.164665e+00 -1.303890e-01
Area Income 1.987625e+05 -3.195989e+03 1.347223e+02
Daily Internet Usage 1.927415e+03 -1.727409e+01 9.525717e-01
Clicked on Ad -1.727409e+01 2.502503e-01 -6.747326e-03
times 9.525717e-01 -6.747326e-03 8.412110e-02
heatmap(cor(advert),Rowv = NA, Colv = NA,scale = "column", margins = c(10,10))
There was a strong relationship on clicking an ad and the age of the consumer
There is a positive relationship between daily internet usage and (time on site and area income).
With an increase in internet usage times so did time
As the Area income increased time on site increased.
The task could probably have been performed better by a machine learning model complementing the analysis.
Consumers from the followint cities: Lake David,Lake James,Lisamouth,Michelleside,Millerbury,Robertfurt,South Lisa,West Amanda,West Shannon,Williamsport